Contents

  • Reviews
  • Introduction
  • Inference as Optimization
  • Expectation Maximization
  • MAP Inference and Sparse Coding
  • Variational Inference and Learning
    • Discrete Latent Variables
    • Calculus of Variations
    • Continuous Latent Variables
    • Interactions between Learning and Inference
  • Learned Approximate Inference
    • Wake-Sleep
    • Other Forms of Learned Inference

Reviews

Restricted Boltzmann Machine (Ch 16)

  • Energy-based Model
    • $\tilde{p}(\mathbf{x})=\exp(-E(\mathbf{x}))$

Discrete Case of RBM

  • All $\mathbf{h}_i$, $\mathbf{v}_i$ are 0 or 1

  • We can extract closed form $p(\mathbf{h}|\mathbf{v})$, $p(\mathbf{v}|\mathbf{h})$
  • How $p(\mathbf{v})$?

Intractability of computing Partition Functions, Again

  • Computing $\tilde{p}(\mathbf{v})$ is easy
  • But How to compute $p(\mathbf{v}) = \cfrac{1}{Z} \tilde{p}(\mathbf{v})$
  • We need to Compute Partition Function $Z$

  • Given Visible

    • $\mathbf{x}^{(i)}$
  • Generated "Visible" and Hidden Variables from Gibbs Sampling
    • $\tilde{\mathbf{x}}^{(i)}$, $\mathbf{h}^{(j)}$
  • Variables: $W$, $\mathbf{b}$, $\mathbf{c}$
    • Calculating Gradients of these.

Introduction

The Chanllege of inference usually refers to the difficult problem of computing $p(h|v)$ or taking expectations with respect to it

  • $p(v)$, $p(h|v)$, $p(v|h)$ are important! for the next explanation

19.1 Inference as Optimization

  • What we want to know $\log p(v;\theta)$
    • Inference of $\log p(v;\theta)$ is intractable
    • We get Lower Bound $L$ instead of it
    • Consider hidden variables $h$
    • Induce new probability $q$

  • $L$ is tighter
    • => $q(h|v)$ are better approximations of $p(h|v)$
    • $q(h|v) == p(h|v)$ => $L = \log p(v;\theta)$

19.2 Expectation Maximization

  • A Popular Training Algorithm for models with Latent Variables.
    • K-Mean
  • Not Approximate Inference
  • Approximate Posterior
  • Stochastic gradient ascent on latent variable models can be seen as a special case of the EM algorithm
    • where the M step consists of taking a single gradient step.

19.3 MAP Inference and Sparse Coding

19.4 Variational Inference and Learning

  • "VARIATIONAL"

    • Use function as parameters
    • function $q$ in $\mathcal{L}(\mathbf{v}, \theta, q)$
  • The Core Idea

    • Maximize $\mathcal{L}$ over a restricted family of distributions q.
      • Not just control $\theta$
      • Control a family of $q$
    • Impose the restriction on $q$
  • Mean field Approach

    • $q(\mathbf{h}|\mathbf{v}) = \prod_{i}q(\mathbf{h}_i|\mathbf{v})$
    • to maximize $\log p(\mathbf{v};\theta)-D_{KL}(q(\mathbf{h}|\mathbf{v})||p(\mathbf{h}|\mathbf{v};\theta))$

19.4.1 Discrete Latent Variables

Binary sparce coding model case

  • $\mathbf{h}_i$ is binary.
  • set $\hat{h}_i = q(\mathbf{h}_i=1|\mathbf{v})$
    • $1 - \hat{h}_i = q(\mathbf{h}_i=0|\mathbf{v})$
  • $p(h_i = 1) = \sigma(b_i)$
  • $p(v|h) = \mathcal{N}(v;Wh, \beta^{-1})$

  • Target

    • $p(v)$ Only!!!
    • Use $h$
      • generated by $\sigma(b_i)$
      • $b_i$ is key!!!

  • Find $b_i$ to maximize $p(v)$
    • We need $p(h|v)$
  • $p(h|v) \approx q(h|v) = \Pi_i q(h_i|v)$
    • Remove $p(h|v)$ and insert $\Pi_i q(h_i|v)$
    • set $\hat{h}_i = q(\mathbf{h}_i=1|\mathbf{v})$
    • set $1 - \hat{h}_i = q(\mathbf{h}_i=0|\mathbf{v})$

  • Fixed point update, equation
    • for $\cfrac{\partial}{\partial \hat{h}_i} \mathcal{L}(v, \theta, \hat{h}) =0$
    • $\hat{h}_i^t = f(\hat{h}_j^{t-1})$
    • $\hat{h}_j^t = f(\hat{h}_i^{t-1})$
    • After iterations, the values converge true-answer
  • RNN
    • Calculate $\hat{h}_i$ using other $\hat{h}_j$

19.4.2 Calculus of Variations

19.4.3 Continuous Latent Variables

  • We can find $\tilde{q}$s maximizing $\mathcal{L}(v,\theta, q)$
  • Induction based on Kevin Murphy's Book

  • Assumption for simplicity
    • $h \in \mathbb{R}^2$ -> $i = 1,2$
    • $p(h) = \mathcal{N}(h;0, \mathbf{I})$
    • $p(v|h) = \mathcal{N}(v;W^Th,1)$
  • Find Compatibility, $\tilde{p}$ (Unnormalized Probability)

  • Reduce Notations
    • $\langle h_2 \rangle=\mathbb{E}_{h_2 \sim q(h|v)}[h_2]$
    • $\langle h^2_2 \rangle=\mathbb{E}_{h_2 \sim q(h|v)}[h^2_2]$
  • We have proved $\tilde q$ is Gaussian, $q$ is also Gaussian.
    • set new $q=\mathcal{N}(h;\mu, \beta^{-1})$
    • Find $\mu, \beta$ by using traditional oprimization method
    • $\mu, \beta$ are parameters for variational approximate
    • $w$ is a parameter for learning process

Overall Process

main-loop for gradient update

  1. approximate inference-loop

    update $q_i$s

  2. loop for-MCMC for partition function

    sampling

  3. update gradient

19.4.4 Interactions between Learning and Inference

  • Approximate inference <-> Learning process
  • Final Goal is to maximize $p(v,h)$
  • Intermediate Goal is to maxmize $\mathbb{E}_{h\sim q} \log p(v,h)$
  • Modality difference can cause bad approximate
  • Need to calculate the difference between $\log p(v;\theta)$ and $\mathcal{L}(v, \theta, q)$ for checking approximate quality

19.5 Learned Approximate Inference

  • Optimization via iterative procedures such as fixed-point equations is often very expensive and time-consuming.
  • find inference network $\hat{f}(v;\theta) \approx q$

19.5.1 Wake-Sleep

  • awake -> update $\theta$
  • asleep -> update $\hat{f}$

    main-loop

    1. wake-loop

      update $\hat{f}$

    2. sleep-loop

      update gradient

  • c.f. mean-field approximation loop

    main-loop for gradient update

    1. approximate inference-loop

      update $q_i$s

    2. loop for-MCMC for partition function

      sampling

    3. update gradient

19.5.2 Other Forms of Learned Inference

  • Learned approximate inference has recently become one of the dominant approaches to generative modeling
    • In the form of the variational autoencoder.

In [ ]: